LOADING THE PACKAGES¶

In [1]:
# Data Manipulation and Representation
import pandas as pd
import numpy as np

# Statistical Analysis
import scipy.stats as stats
import statsmodels.api as sm
import pingouin as pg
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Web Interaction and Display
from IPython.display import Image, display, HTML

# Visualization
import matplotlib.pyplot as plt

# Miscellaneous
import warnings
warnings.filterwarnings("ignore")


import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
import chardet

from scipy.stats import ttest_rel, f_oneway

import statsmodels.api as sm

# Additional JavaScript for toggling code display in Jupyter Notebooks
HTML(
    """
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js "></script>
<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
 } else {
 $('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit"
    value="Click here to toggle on/off the raw code."></form>
"""
)
Out[1]:

title.png

ABSTRACT

This study explores the relationship between Gross Domestic Product (GDP) and CO2 emissions across various income brackets, testing the Environmental Kuznets Curve (EKC) hypothesis. Utilizing regression models and statistical analysis, it was found that high-income countries have significantly higher CO2 emissions per capita, indicating a strong correlation between economic prosperity and environmental impact. The analysis reveals that GDP is a crucial predictor of CO2 emissions in both developed and developing nations, with a more pronounced effect in the latter. These findings suggest that economic growth alone may not lead to environmental improvement, challenging the inevitability of the EKC hypothesis. Based on these insights, the study recommends the implementation of differential carbon taxes based on income brackets, the promotion of technology transfer to developing countries, and the establishment of international standards for sustainable development. These policies aim to align economic development with environmental sustainability, addressing the nuanced dynamics of global economic and environmental interactions.

INTRODUCTION

Background

The "Paris Agreement" is a legally binding international convention on climate change that was agreed into by 196 Parties during the UN Climate Change Conference in Paris, France on December 12, 2015. A crucial aspect of this accord is Financing, specifically the Paris accord reiterates that wealthier nations should take the lead in providing financial aid to countries who are less affluent and more susceptible. One rationale for this line of thinking is the assumption that developed nations have generated the majority of worldwide CO2 emissions.

An applicable theory to this subject is the environmental Kuznets Curve (EKC). The Environmental Kuznets Curve (EKC) posits that economic progress initially results in environmental deterioration, but after a certain threshold of economic growth is reached, society starts to enhance its environmental connection and lower levels of environmental degradation.

This hypothesis faces substantial criticism due to the absence of a guarantee that economic expansion would result in an enhanced environment. For most developed countries the opposite is often the case, at the very least it requires a very targeted policy and attitude to make sure economic growth is compatible with an improving environment.

Problem Statement

This study aims to explore the relationship between GDP and CO2 emissions across different income brackets, focusing on the validity of the Environmental Kuznets Curve (EKC) hypothesis. It investigates whether economic growth leads to increased environmental degradation, as represented by CO2 emissions, before a turning point is reached where further economic development corresponds with environmental improvement. Through statistical analysis and regression models, the research seeks to discern the patterns of CO2 emissions in developed and developing countries, thereby shedding light on the global dynamics of economic development and environmental sustainability.

Objectives

The following are the objectives of this study.

  1. Understand the Relationship Between GDP Growth and CO2 Emissions:
    Evaluate the correlation between economic growth (GDP) and changes in CO2 emissions in the context of the Environmental Kuznets Curve hypothesis.

  2. Assess the Environmental Impact of COVID-19:
    Investigate the statistical significance of the COVID-19 pandemic's impact on global CO2 emissions and the GDP of various countries.

  3. Analyze Disparities in GDP Per Capita Across Income Brackets:
    Examine variances in GDP per capita among countries from different income brackets and their relation to CO2 emissions.

  4. Develop Predictive Models for CO2 Emissions:
    Construct and refine linear regression models to predict CO2 emissions, incorporating economic indicators, and optimize these models through feature selection.

Ultimately, after knowing all the key insights gained from these objectives, the team wants to offer some recommendations for International Policy Insights. This study would allow policymakers to balance economic growth with environmental sustainability, in line with the Paris Agreement goals.

Methodology

In this statistics case project, our methodology focuses on analyzing environmental and economic data. We start by processing data, selecting specific countries, merging datasets, and preparing them for Isolated Regression Models. Next, we create a user-friendly pipeline for regression model functions. Our statistical analysis aims to answer key questions about the Earth's healing, COVID-19's impact on GDP, and GDP per capita variances across income brackets, using T-tests and ANOVA for hypothesis testing. Finally, we conduct Linear Regression Analysis to predict CO2 emissions, involving model development, feature selection, and refinement, complemented by plots of residuals, observed vs. fitted values, and Normal Q-Q plots.

Step-by-Step Process:

  1. Data Processing: Select specific countries for inclusion in the study. Merge various datasets to create a comprehensive and unified dataset. Prepare the data for application in Isolated Regression Models.

  2. Pipeline Creation: Develop a pipeline that facilitates the ease of use of the required functions for regression analysis.

  3. Statistical Analysis: Perform hypothesis testing, including T-Tests (Related Samples) and ANOVA. Address key questions:

  • Is Earth healing?
  • Did COVID-19 significantly impact the GDP of all countries?
  • Is there a significant variance in GDP per capita across different income brackets?
  1. Linear Regression Analysis:
  • Develop a full model to predict CO2 emissions.
    • Conduct feature selection to optimize and refine the model.
    • Generate an improved model based on the selected features.
    • Visualize the analysis through plots for residuals, observed vs. fitted values, and the Normal Q-Q Plot.
  • Develop a full model for developed and developing countries
    • Do the same succeeding steps as the full model involving all countries

DATA DESCRIPTION

The Data.csv is a tabular data structure with several columns, including "Country Name," "Country Code," "Series Name," "Series Code," and data for the years 2019 and 2020. It contains information related to carbon dioxide (CO2) emissions, Gross Domestic Product (GDP), population, and urban land area for various countries. Here's a breakdown of the columns:

Column Name Description
Country Name The name of the country.
Country Code A unique code assigned to each country.
Series Name Descriptive name of the economic or environmental indicator.
Series Code A code associated with the series, possibly used for identification.
2019 [YR2019] Data for the year 2019 related to the specified series.
2020 [YR2020] Data for the year 2020 related to the specified series.
Table 1. Data CSV Features

This mapper_df.csv dataset a consist of two columns: "Country Name" and "Income Bracket." The "Country Name" column contains the names of various countries, while the "Income Bracket" column categorizes each country into income brackets using abbreviations such as "L" (Low), "UM" (Upper Middle), "LM" (Lower Middle), and "H" (High). Here's a breakdown of the columns:

Column Name Description
Country Name The name of the country.
Income Bracket The income bracket classification assigned to each country.
Table 2. Mapper CSV Features for income bracket

The renewable.csv is a table containing information on global electricity generation from various sources for the years 1971 to 2021. Here's a breakdown of the columns:

Column Name Description
Entity This column represents the geographical or political entity for which the electricity generation data is reported. It includes both individual regions like "Africa" and a global aggregate labelled as "World."
Code This column contains a code or identifier associated with each entity. For individual regions.
Year This column indicates the corresponding year for which the electricity generation data is recorded. The data spans from 1971 to 2021.
Geo Biomass Other - TWh This column represents electricity generation in terawatt-hours from sources categorized as "Geo Biomass Other."
Solar Generation - TWh This column represents electricity generation in terawatt-hours from solar sources.
Wind Generation - TWh This column represents electricity generation in terawatt-hours from wind sources.
Hydro Generation - TWh This column represents electricity generation in terawatt-hours from hydroelectric sources.
Table 3. More information for countries

DATA PROCESSING

In [2]:
df = pd.read_csv('Data.csv')
In [3]:
# Import Country Classification per Income Bracket
mapper = pd.read_excel('Income_Bracket.xlsx',
                       sheet_name='Country Analytical History',
                       skiprows=range(11),
                       index_col=0,
                       header=None, usecols=[1, 34]).to_dict()[34]
In [4]:
country_list = ['Afghanistan', 'Norway', 'Mozambique', 'Myanmar', 'Namibia',
                'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand',
                'Nicaragua', 'Niger', 'Nigeria', 'North Macedonia',
                'Northern Mariana Islands', 'Montenegro', 'Oman', 'Pakistan',
                'Palau', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru',
                'Philippines', 'Poland', 'Portugal', 'Morocco', 'Mongolia',
                'Madagascar', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya',
                'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao SAR, China',
                'Malawi', 'Monaco', 'Malaysia', 'Maldives', 'Mali', 'Malta',
                'Marshall Islands', 'Mauritania', 'Mauritius', 'Mexico',
                'Micronesia, Fed. Sts.', 'Moldova', 'Puerto Rico', 'Qatar',
                'Uganda', 'Switzerland', 'Syrian Arab Republic', 'Tajikistan',
                'Tanzania', 'Thailand', 'Timor-Leste', 'Togo', 'Tonga',
                'Trinidad and Tobago', 'Tunisia', 'Turkiye', 'Turkmenistan',
                'Turks and Caicos Islands', 'Tuvalu', 'Ukraine', 'Romania',
                'United Arab Emirates', 'United Kingdom', 'United States',
                'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, RB', 'Viet Nam',
                'Virgin Islands (U.S.)', 'West Bank and Gaza', 'Yemen, Rep.',
                'Zambia', 'Sweden', 'Suriname', 'Sudan', 'Russian Federation',
                'Rwanda', 'Samoa', 'San Marino',
                'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone',
                'Singapore', 'Sint Maarten (Dutch part)', 'Slovak Republic',
                'Slovenia', 'Solomon Islands', 'Somalia',
                'South Africa', 'South Sudan', 'Spain', 'Sri Lanka',
                'St. Kitts and Nevis', 'St. Lucia', 'St. Martin (French part)',
                'St. Vincent and the Grenadines', 'Burkina Faso', 'Cabo Verde',
                'Cambodia', 'Cameroon', 'Canada', 'Cayman Islands',
                'Central African Republic',
                'Chad', 'Channel Islands', 'Chile', 'China', 'Colombia', 'Comoros',
                'Congo, Dem. Rep.', 'Congo, Rep.', 'Costa Rica', "Cote d'Ivoire",
                'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czechia', 'Denmark',
                'Djibouti', 'Dominica', 'Dominican Republic', 'Burundi', 'Bulgaria',
                'Brunei Darussalam', 'Albania', 'Algeria', 'American Samoa',
                'Andorra', 'Angola', 'Antigua and Barbuda', 'Argentina', 'Armenia',
                'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas, The',
                'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize',
                'Benin', 'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina',
                'Botswana', 'Brazil', 'British Virgin Islands', 'Ecuador', 'Guyana',
                'Ireland', 'Honduras', 'Hong Kong SAR, China', 'Hungary', 'Iceland',
                'India', 'Indonesia', 'Iran, Islamic Rep.', 'Iraq', 'Isle of Man',
                'Haiti', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan',
                'Kenya', 'Kiribati', "Korea, Dem. People's Rep.", 'Korea, Rep.',
                'Kosovo', 'Kuwait', 'Kyrgyz Republic', 'Lao PDR', 'Zimbabwe',
                'Egypt, Arab Rep.', 'Guinea-Bissau', 'El Salvador',
                'Equatorial Guinea', 'Eritrea', 'Estonia', 'Eswatini', 'Ethiopia',
                'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Polynesia',
                'Gabon', 'Gambia, The', 'Georgia', 'Germany', 'Ghana', 'Gibraltar',
                'Greece', 'Greenland', 'Grenada', 'Guam', 'Guatemala', 'Guinea']

RETRIEVE COUNTRIES WITH AVAILABLE DATA ON ALL FEATURES¶

In [5]:
# EMISSION
df_emission = df[(df['Series Name'] == 'CO2 emissions (kt)') &
                 (df['2019 [YR2019]'] != '..') &
                 (df['2020 [YR2020]'] != '..')] \
    .set_index('Country Name')
emission_countries = set(df_emission.index).intersection(set(country_list))

# GDP
df_gdp = df[(df['Series Name'] == 'GDP (constant 2015 US$)') &
            (df['2019 [YR2019]'] != '..') &
            (df['2020 [YR2020]'] != '..')] \
    .set_index('Country Name')
gdp_countries = set(df_gdp.index).intersection(set(country_list))

# INTERSECTION OF AVAILABLE DATA
country_intersect = sorted(
    list(emission_countries.intersection(gdp_countries)))

CREATE DATA SUBSETS¶

In [6]:
# Emission Datasets
df_emission = df_emission.loc[country_intersect]
df_emission_2019 = df_emission['2019 [YR2019]'].astype('float')
df_emission_2020 = df_emission['2020 [YR2020]'].astype('float')

# GDP Datasets
df_gdp = df_gdp.loc[country_intersect]
df_gdp_2019 = df_gdp['2019 [YR2019]'].astype(float)
df_gdp_2020 = df_gdp['2020 [YR2020]'].astype(float)

# Population Datasets
df_pop = df[(df['Series Name'] == 'Population, total') &
            (df['2019 [YR2019]'] != '..')] \
    .set_index('Country Name')
df_pop = df_pop.loc[list(country_intersect)]
df_pop_2019 = df_pop['2019 [YR2019]'].astype('float')
df_pop_2020 = df_pop['2020 [YR2020]'].astype('float')

# Emission per Capita Dataset
df_epc_2019 = pd.DataFrame(
    {'CO2 per capita': (df_emission_2019/df_pop_2019*1000).to_list(),
     'Income Bracket': [mapper[country] for country in df_emission_2019.index]},
    index=df_emission_2019.index)

1. Load the energy dataset¶


In [7]:
df_energy = pd.read_csv('renewable.csv', usecols=lambda x: x not in ['Code'])
df_energy.head(3)
Out[7]:
Entity Year Geo Biomass Other - TWh Solar Generation - TWh Wind Generation - TWh Hydro Generation - TWh
0 Africa 1971 0.164 0.0 0.0 26.013390
1 Africa 1972 0.165 0.0 0.0 29.633196
2 Africa 1973 0.170 0.0 0.0 31.345707
Table 4. Sample Data of Renewable CSV

2. Get the countries of interest and 2019 and 2020 values¶


In [8]:
wanted_countries = ['Algeria', 'Argentina', 'Australia', 'Austria', 
                    'Azerbaijan', 'Bangladesh', 'Belarus', 'Belgium',
                    'Brazil', 'Bulgaria', 'Canada', 'Chile', 'China', 
                    'Colombia', 'Croatia', 'Cyprus', 'Czechia', 'Denmark',
                    'Ecuador', 'Egypt', 'Estonia', 'Finland', 'France', 
                    'Germany', 'Greece', 'Hong Kong', 'Hungary', 'Iceland',
                    'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel',
                    'Italy', 'Japan', 'Kazakhstan', 'Kuwait', 'Latvia',
                    'Lithuania', 'Luxembourg', 'Malaysia', 'Mexico', 'Morocco',
                    'Netherlands', 'New Zealand', 'North Macedonia', 'Norway',
                    'Oman', 'Pakistan', 'Peru', 'Philippines', 
                    'Poland', 'Portugal', 'Qatar', 'Romania',  'Russia', 
                    'Saudi Arabia', 'Singapore', 'Slovakia', 'Slovenia', 
                    'South Africa', 'South Korea', 'Spain', 'Sri Lanka', 
                    'Sweden', 'Switzerland', 'Taiwan', 'Thailand', 
                    'Trinidad and Tobago', 'Turkey', 'Turkmenistan', 'Ukraine',
                    'United Arab Emirates', 'United Kingdom', 'United States', 
                    'Uzbekistan', 'Venezuela', 'Vietnam']

# Only get the countries of interest
df_energy = df_energy[df_energy['Entity'].isin(wanted_countries)]

2019 Values¶

In [9]:
# Get 2019 values only
df_energy_2019 = df_energy[df_energy['Year'] == 2019]

# Get the total renewable energy consumption
df_energy_2019['Total Renewable Energy Consumption'] = (df_energy_2019
                                                        .iloc[:, 3:7]
                                                        .sum(axis=1))
df_energy_2019 = df_energy_2019.loc[:, ['Entity', 'Total Renewable Energy Consumption']]
df_energy_2019.head(3)
Out[9]:
Entity Total Renewable Energy Consumption
168 Algeria 0.777000
225 Argentina 33.300167
396 Australia 51.789451
Table 5. Energy Consumption for 2019

2020 Values¶

In [10]:
# Get 2020 values only
df_energy_2020 = df_energy[df_energy['Year'] == 2020]

# Get the total renewable energy consumption
df_energy_2020['Total Renewable Energy Consumption'] = (df_energy_2020
                                                        .iloc[:, 3:7]
                                                        .sum(axis=1))
df_energy_2020 = df_energy_2020.loc[:, ['Entity', 'Total Renewable Energy Consumption']]
df_energy_2020.head(3)
Out[10]:
Entity Total Renewable Energy Consumption
169 Algeria 0.742300
226 Argentina 34.424309
397 Australia 60.870478
Table 6. Energy Consumption for 2020

3. World Bank Data¶


In [11]:
df_wb = df.loc[:, ~df.columns.isin(['Series Code', 'Country Code'])]
df_wb.head(3)
Out[11]:
Country Name Series Name 2019 [YR2019] 2020 [YR2020]
0 Afghanistan CO2 emissions (kt) 11238.83 8709.47
1 Afghanistan GDP (constant 2015 US$) 22071985906.2168 21553051296.9328
2 Afghanistan Population, total 37769499 38972230
Table 7. Energy Consumption for 2019

Get the countries of interest and 2019 values¶

In [12]:
wanted_countries = ['Algeria', 'Argentina', 'Australia', 'Austria', 
                    'Azerbaijan', 'Bangladesh', 'Belarus', 'Belgium',
                    'Brazil', 'Bulgaria', 'Canada', 'Chile', 'China', 
                    'Colombia', 'Croatia', 'Cyprus', 'Czechia', 'Denmark',
                    'Ecuador', 'Egypt', 'Estonia', 'Finland', 'France', 
                    'Germany', 'Greece', 'Hong Kong', 'Hungary', 'Iceland',
                    'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel',
                    'Italy', 'Japan', 'Kazakhstan', 'Kuwait', 'Latvia',
                    'Lithuania', 'Luxembourg', 'Malaysia', 'Mexico', 'Morocco',
                    'Netherlands', 'New Zealand', 'North Macedonia', 'Norway',
                    'Oman', 'Pakistan', 'Peru', 'Philippines', 
                    'Poland', 'Portugal', 'Qatar', 'Romania',  'Russia', 
                    'Saudi Arabia', 'Singapore', 'Slovakia', 'Slovenia', 
                    'South Africa', 'South Korea', 'Spain', 'Sri Lanka', 
                    'Sweden', 'Switzerland', 'Taiwan', 'Thailand', 
                    'Trinidad and Tobago', 'Turkey', 'Turkmenistan', 'Ukraine',
                    'United Arab Emirates', 'United Kingdom', 'United States', 
                    'Uzbekistan', 'Venezuela', 'Vietnam']

# Only get the countries of interest
df_wb = df_wb[df_wb['Country Name'].isin(wanted_countries)]
df_wb.head(3)
Out[12]:
Country Name Series Name 2019 [YR2019] 2020 [YR2020]
8 Algeria CO2 emissions (kt) 170582.4 161563
9 Algeria GDP (constant 2015 US$) 177355540257.925 168310407704.439
10 Algeria Population, total 42705368 43451666
Table 8. World Bank Data for year 2019 and 2020

World Bank Data --- Merged¶

In [13]:
df_wb_co2 = df_wb[df_wb['Series Name'] == 'CO2 emissions (kt)']
df_wb_gdp = df_wb[df_wb['Series Name'] == 'GDP (constant 2015 US$)']
df_wb_pop = df_wb[df_wb['Series Name'] == 'Population, total']

# Rename the columns per dataframe
# CO2
df_wb_co2.rename(columns={'2019 [YR2019]': 'CO2 Emissions (2019)',
                          '2020 [YR2020]': 'CO2 Emissions (2020)'}, 
                 inplace=True)
df_wb_co2.drop('Series Name', axis=1, inplace=True)

# GDP
df_wb_gdp.rename(columns={'2019 [YR2019]': 'GDP (2019)',
                          '2020 [YR2020]': 'GDP (2020)'}, 
                 inplace=True)
df_wb_gdp.drop('Series Name', axis=1, inplace=True)

# Population
df_wb_pop.rename(columns={'2019 [YR2019]': 'Population (2019)',
                          '2020 [YR2020]': 'Population (2020)'}, 
                 inplace=True)
df_wb_pop.drop('Series Name', axis=1, inplace=True)

df_wb_merged = pd.merge(df_wb_co2, df_wb_gdp, on='Country Name', how='inner')
df_wb_merged = pd.merge(df_wb_merged, df_wb_pop, on='Country Name', 
                        how='inner')
df_wb_merged.head(3)
Out[13]:
Country Name CO2 Emissions (2019) CO2 Emissions (2020) GDP (2019) GDP (2020) Population (2019) Population (2020)
0 Algeria 170582.4 161563 177355540257.925 168310407704.439 42705368 43451666
1 Argentina 168162 154535.9 571450737224.442 514630046744.607 44938712 45376763
2 Australia 395199.1 378996.8 1491740073728.97 1490980996778.96 25340217 25655289
Table 9. Merged World Bank Data for year 2019 and 2020

World Bank Data --- 2020¶

In [14]:
df_wb_2019 = df_wb_merged[['Country Name', 'CO2 Emissions (2019)', 'GDP (2019)', 'Population (2019)']]
df_wb_2020 = df_wb_merged[['Country Name', 'CO2 Emissions (2020)', 'GDP (2020)', 'Population (2020)']]
df_wb_2020.head(3)
Out[14]:
Country Name CO2 Emissions (2020) GDP (2020) Population (2020)
0 Algeria 161563 168310407704.439 43451666
1 Argentina 154535.9 514630046744.607 45376763
2 Australia 378996.8 1490980996778.96 25655289
Table 10. Merged World Bank Data for year 2020

4. Merge Renewable dataset and WB dataset¶


In [15]:
# # For 2019
df_all_2019 = pd.merge(df_wb_2019, df_energy_2019, left_on='Country Name', 
                       right_on='Entity',
         how='outer').drop(['Country Name', 'Entity'], axis=1)

df_all_2019.dropna(axis=0, how='any', inplace=True)
df_all_2019 = df_all_2019.astype(float)

# Set 'Country Name' as the index (and assign the result back to df_all_2020)
# df_all_2019 = df_all_2019.set_index('Country Name')
df_all_2019.head(3)

# # For 2020
df_all_2020 = pd.merge(df_wb_2020, df_energy_2020, left_on='Country Name', 
                       right_on='Entity',
         how='outer').drop(['Country Name', 'Entity'], axis=1)
df_all_2020.dropna(axis=0, how='any', inplace=True)
df_all_2020= df_all_2020.astype(float)

# Set 'Country Name' as the index (and assign the result back to df_all_2020)
# df_all_2020 = df_all_2020.set_index('Country Name')
df_all_2020.head(3)
Out[15]:
CO2 Emissions (2020) GDP (2020) Population (2020) Total Renewable Energy Consumption
0 161563.0 1.683104e+11 43451666.0 0.742300
1 154535.9 5.146300e+11 45376763.0 34.424309
2 378996.8 1.490981e+12 25655289.0 60.870478
Table 11. Merged World Bank Data for GDP/Capita and Renewable Energy Consumption

5. Data Loading and Pre-processing for the Isolated Regression Models¶


In [16]:
# For isolated regression later (2019)
df_all_2019_2 = pd.merge(
    df_wb_2019, df_energy_2019, left_on="Country Name", right_on="Entity", how="outer"
).drop(["Entity"], axis=1)

df_all_2019_2.dropna(axis=0, how="any", inplace=True)
df_all_2019_2.iloc[:, 1:] = df_all_2019_2.iloc[:, 1:].astype(float)

# For isolated regression later (2020)
df_all_2020_2 = pd.merge(
    df_wb_2020, df_energy_2020, left_on="Country Name", right_on="Entity", how="outer"
).drop(["Entity"], axis=1)

df_all_2020_2.dropna(axis=0, how="any", inplace=True)
df_all_2020_2.iloc[:, 1:] = df_all_2020_2.iloc[:, 1:].astype(float)

# What is the income bracket of each country?
mapper = pd.read_csv("mapper_df.csv")

# We only consider countries based on our wanted_countries variable
all_countries_labels = mapper[mapper["Country Name"].isin(wanted_countries)]
all_countries_labels.head(3)

# Include the Income Bracket column of the mapper to both df_all_2019_2 and df_all_2020_2
df_all_2019_2 = pd.merge(
    df_all_2019_2, all_countries_labels, on="Country Name")
df_all_2020_2 = pd.merge(
    df_all_2020_2, all_countries_labels, on="Country Name")

# Developed countries - 2019
developed_2019 = (
    df_all_2019_2[df_all_2019_2["Income Bracket"] == "H"]
    .reset_index(drop=True)
    .drop(["Country Name", "Income Bracket"], axis=1)
)
developed_2019 = developed_2019.astype(float)
developing_2019 = (
    df_all_2019_2[df_all_2019_2["Income Bracket"].isin(["L", "LM", "UM"])]
    .reset_index(drop=True)
    .drop(["Country Name", "Income Bracket"], axis=1)
)
developing_2019 = developing_2019.astype(float)

# Developed countries - 2020
developed_2020 = (
    df_all_2020_2[df_all_2020_2["Income Bracket"] == "H"]
    .reset_index(drop=True)
    .drop(["Country Name", "Income Bracket"], axis=1)
)
developed_2020 = developed_2020.astype(float)
developing_2020 = (
    df_all_2020_2[df_all_2020_2["Income Bracket"].isin(["L", "LM", "UM"])]
    .reset_index(drop=True)
    .drop(["Country Name", "Income Bracket"], axis=1)
)
developing_2020 = developing_2020.astype(float)

EXPLORATORY DATA ANALYSIS

In [17]:
# Load datasets
data_path = 'Data.csv'
metadata_path = 'SeriesMetadata.csv'

data = pd.read_csv(data_path, encoding='ascii', nrows=1064)
metadata = pd.read_csv(metadata_path, encoding='ISO-8859-1', nrows=1064)

SQLite¶

In [18]:
conn = sqlite3.connect("ACS-LT6.db")

# Update column names to remove spaces
data_updated = data.copy()

data_updated["2019 [YR2019]"] = pd.to_numeric(
    data_updated["2019 [YR2019]"].replace("..", 0), errors="coerce"
).astype("float64")
data_updated["2020 [YR2020]"] = pd.to_numeric(
    data_updated["2020 [YR2020]"].replace("..", 0), errors="coerce"
).astype("float64")

data_updated.columns = data_updated.columns.str.replace(" ", "")

# Extracting updated data for each table
country_data_updated = data_updated[[
    "CountryName", "CountryCode"]].drop_duplicates()
series_data_updated = data_updated[[
    "SeriesName", "SeriesCode"]].drop_duplicates()

# Rename '2019[YR2019]' to 'DataValue' for the 2019 table
year_2019_data_value = data_updated[[
    "CountryCode", "SeriesCode", "2019[YR2019]"]]
year_2019_data_value.rename(
    columns={"2019[YR2019]": "DataValue"}, inplace=True)

# Rename '2020[YR2020]' to 'DataValue' for the 2020 table
year_2020_data_value = data_updated[[
    "CountryCode", "SeriesCode", "2020[YR2020]"]]
year_2020_data_value.rename(
    columns={"2020[YR2020]": "DataValue"}, inplace=True)

# Write the updated data to SQL tables
country_data_updated.to_sql(
    "countries", conn, if_exists="replace", index=False)
series_data_updated.to_sql("series", conn, if_exists="replace", index=False)
year_2019_data_value.to_sql("Y2019", conn, if_exists="replace", index=False)
year_2020_data_value.to_sql("Y2020", conn, if_exists="replace", index=False)

# Close the connection
conn.close()
In [19]:
# Connect to the SQLite database
conn = sqlite3.connect('ACS-LT6.db')

# Retrieve the list of tables in the database
tables_query = "SELECT name FROM sqlite_master WHERE type='table';"
tables = pd.read_sql_query(tables_query, conn)
table_names = tables['name'].tolist()

# Function to load data from a table and perform EDA
def perform_eda(table_name):
    # Load data into a DataFrame
    query = f"SELECT * FROM {table_name}"
    table_df = pd.read_sql_query(query, conn)

    # Basic Data Overview
    print(f"Data Overview for {table_name}:")
    print(table_df.info())
    print(table_df.head())

    # Descriptive Statistics
    print(f"\nDescriptive Statistics for {table_name}:")
    print(table_df.describe())
    return table_df

def display_histogram(table_name, data):
    # Visualization: Histograms for numeric data
    numeric_cols = data.select_dtypes(include=['float64', 'int64']).columns
    for col in numeric_cols:
        plt.figure(figsize=(8, 6))
        sns.histplot(data[col], kde=True)
        plt.title(f'Distribution of {col} in {table_name}')
        plt.xlabel(col)
        plt.ylabel('Frequency')
        plt.show()

def dsiplay_boxplots(table_name, data):
    # Visualization: Boxplots for numeric data
    numeric_cols = data.select_dtypes(include=['float64', 'int64']).columns
    for col in numeric_cols:
        plt.figure(figsize=(8, 6))
        sns.boxplot(data[col])
        plt.title(f'Boxplot of {col} in {table_name}')
        plt.xlabel(col)
        plt.show()
        
def dsiplay_corr(table_name, data):
    # Visualization: Boxplots for numeric data
    numeric_cols = data.select_dtypes(include=['float64', 'int64']).columns
    # Correlation Matrix for numeric data
    if len(numeric_cols) > 1:
        plt.figure(figsize=(10, 8))
        sns.heatmap(data[numeric_cols].corr(), annot=True, cmap='coolwarm')
        plt.title(f'Correlation Matrix for {table_name}')
        plt.show()

# Perform EDA for each table in the database
for table in table_names:
    table_data = perform_eda(table)
    #display_histogram(table, table_data)
    
    
# Close the connection
conn.close()
Data Overview for countries:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   CountryName  266 non-null    object
 1   CountryCode  266 non-null    object
dtypes: object(2)
memory usage: 4.3+ KB
None
      CountryName CountryCode
0     Afghanistan         AFG
1         Albania         ALB
2         Algeria         DZA
3  American Samoa         ASM
4         Andorra         AND

Descriptive Statistics for countries:
        CountryName CountryCode
count           266         266
unique          266         266
top     Afghanistan         AFG
freq              1           1
Data Overview for series:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   SeriesName  4 non-null      object
 1   SeriesCode  4 non-null      object
dtypes: object(2)
memory usage: 192.0+ bytes
None
                 SeriesName         SeriesCode
0        CO2 emissions (kt)     EN.ATM.CO2E.KT
1   GDP (constant 2015 US$)     NY.GDP.MKTP.KD
2         Population, total        SP.POP.TOTL
3  Urban land area (sq. km)  AG.LND.TOTL.UR.K2

Descriptive Statistics for series:
                SeriesName      SeriesCode
count                    4               4
unique                   4               4
top     CO2 emissions (kt)  EN.ATM.CO2E.KT
freq                     1               1
Data Overview for Y2019:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1064 entries, 0 to 1063
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   CountryCode  1064 non-null   object 
 1   SeriesCode   1064 non-null   object 
 2   DataValue    1064 non-null   float64
dtypes: float64(1), object(2)
memory usage: 25.1+ KB
None
  CountryCode         SeriesCode     DataValue
0         AFG     EN.ATM.CO2E.KT  1.123883e+04
1         AFG     NY.GDP.MKTP.KD  2.207199e+10
2         AFG        SP.POP.TOTL  3.776950e+07
3         AFG  AG.LND.TOTL.UR.K2  0.000000e+00
4         ALB     EN.ATM.CO2E.KT  4.993300e+03

Descriptive Statistics for Y2019:
          DataValue
count  1.064000e+03
mean   6.639437e+11
std    4.685527e+12
min    0.000000e+00
25%    0.000000e+00
50%    3.921470e+05
75%    9.775513e+08
max    8.472015e+13
Data Overview for Y2020:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1064 entries, 0 to 1063
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   CountryCode  1064 non-null   object 
 1   SeriesCode   1064 non-null   object 
 2   DataValue    1064 non-null   float64
dtypes: float64(1), object(2)
memory usage: 25.1+ KB
None
  CountryCode         SeriesCode     DataValue
0         AFG     EN.ATM.CO2E.KT  8.709470e+03
1         AFG     NY.GDP.MKTP.KD  2.155305e+10
2         AFG        SP.POP.TOTL  3.897223e+07
3         AFG  AG.LND.TOTL.UR.K2  0.000000e+00
4         ALB     EN.ATM.CO2E.KT  4.383200e+03

Descriptive Statistics for Y2020:
          DataValue
count  1.064000e+03
mean   6.457085e+11
std    4.550461e+12
min    0.000000e+00
25%    0.000000e+00
50%    3.810641e+05
75%    9.116047e+08
max    8.211736e+13
In [20]:
conn = sqlite3.connect('ACS-LT6.db')

# Query to extract GDP and CO2 emissions data for 2019
query_2019 = """
SELECT c.CountryCode, c.CountryName, s.SeriesName, s.SeriesCode, y.DataValue as DataValue2019
FROM Y2019 y
JOIN countries c ON y.CountryCode = c.CountryCode
JOIN series s ON y.SeriesCode = s.SeriesCode
WHERE s.SeriesName = 'GDP (constant 2015 US$)' OR s.SeriesName = 'CO2 emissions (kt)';
"""
data_2019 = pd.read_sql_query(query_2019, conn)

# Query to extract GDP and CO2 emissions data for 2020
query_2020 = """
SELECT c.CountryCode, c.CountryName, s.SeriesName, s.SeriesCode, y.DataValue as DataValue2020
FROM Y2020 y
JOIN countries c ON y.CountryCode = c.CountryCode
JOIN series s ON y.SeriesCode = s.SeriesCode
WHERE s.SeriesName = 'GDP (constant 2015 US$)' OR s.SeriesName = 'CO2 emissions (kt)';
"""
data_2020 = pd.read_sql_query(query_2020, conn)

# Display the extracted data for 2019 and 2020
data_2019.head(), data_2020.head()

conn.close()
In [21]:
conn = sqlite3.connect('ACS-LT6.db')

# Query to extract GDP and CO2 emissions data for 2019
query_2019 = """
SELECT c.CountryCode, c.CountryName, s.SeriesName, s.SeriesCode, y.DataValue as DataValue2019
FROM Y2019 y
JOIN countries c ON y.CountryCode = c.CountryCode
JOIN series s ON y.SeriesCode = s.SeriesCode
WHERE s.SeriesName = 'GDP (constant 2015 US$)' OR s.SeriesName = 'CO2 emissions (kt)';
"""
data_2019 = pd.read_sql_query(query_2019, conn)

# Query to extract GDP and CO2 emissions data for 2020
query_2020 = """
SELECT c.CountryCode, c.CountryName, s.SeriesName, s.SeriesCode, y.DataValue as DataValue2020
FROM Y2020 y
JOIN countries c ON y.CountryCode = c.CountryCode
JOIN series s ON y.SeriesCode = s.SeriesCode
WHERE s.SeriesName = 'GDP (constant 2015 US$)' OR s.SeriesName = 'CO2 emissions (kt)';
"""
data_2020 = pd.read_sql_query(query_2020, conn)

# Display the extracted data for 2019 and 2020
data_2019.head(), data_2020.head()

conn.close()
In [22]:
# Preparing data for OLS regression: GDP as the dependent variable and CO2 emissions as the independent variable

# Filtering and merging data for GDP and CO2 emissions
gdp_data_2019 = data_2019[data_2019['SeriesName'] == 'GDP (constant 2015 US$)']
co2_data_2019 = data_2019[data_2019['SeriesName'] == 'CO2 emissions (kt)']
merged_data_2019 = pd.merge(gdp_data_2019, co2_data_2019, on='CountryCode', suffixes=('_GDP', '_CO2'))

# Filtering and merging data for 2020
gdp_data_2020 = data_2020[data_2020['SeriesName'] == 'GDP (constant 2015 US$)']
co2_data_2020 = data_2020[data_2020['SeriesName'] == 'CO2 emissions (kt)']
merged_data_2020 = pd.merge(gdp_data_2020, co2_data_2020, on='CountryCode', suffixes=('_GDP', '_CO2'))

# Performing OLS for 2019
Y_2019 = merged_data_2019['DataValue2019_GDP']
X_2019 = merged_data_2019['DataValue2019_CO2']
X_2019 = sm.add_constant(X_2019)  # Adds a constant term to the predictor

# Fit the OLS model for 2019
model_2019 = sm.OLS(Y_2019, X_2019).fit()

# Performing OLS for 2020
Y_2020 = merged_data_2020['DataValue2020_GDP']
X_2020 = merged_data_2020['DataValue2020_CO2']
X_2020 = sm.add_constant(X_2020)  # Adds a constant term to the predictor

# Fit the OLS model for 2020
model_2020 = sm.OLS(Y_2020, X_2020).fit()
In [23]:
plt.figure(figsize=(10, 6))
plt.hist(merged_data_2019['DataValue2019_GDP'], bins=20, color='red', alpha=0.7)
plt.title('Histogram of GDP for 2019')
plt.xlabel('GDP (constant 2015 US$)')
plt.ylabel('Frequency')
plt.show()
No description has been provided for this image
Figure 1. Histogram of GDP for year 2019
This visualization depicts a histogram of GDP (Gross Domestic Product) for the year 2019. The x-axis represents GDP in constant 2015 US dollars, while the y-axis shows the frequency of countries falling within each GDP range. The histogram provides insights into the distribution of economic outputs among the countries in the dataset. There are many countries in the low GDP range.
In [24]:
plt.figure(figsize=(10, 6))
plt.hist(merged_data_2019['DataValue2019_CO2'], bins=20, color='blue', alpha=0.7)
plt.title('Histogram of CO2 Emissions for 2019')
plt.xlabel('CO2 Emissions (kt)')
plt.ylabel('Frequency')
plt.show()
No description has been provided for this image
Figure 2. Histogram of CO2 Emissions for year 2019
This graph illustrates a histogram of CO2 emissions for the year 2019. The x-axis displays CO2 emissions in kilotons, and the y-axis represents the frequency of countries falling within specific emission levels. The histogram offers a visual representation of the distribution of carbon dioxide emissions across the included countries in 2019. Similar to GDP, there are also many countries in the low CO2 Emissions range.
In [25]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=[merged_data_2019['DataValue2019_CO2'], merged_data_2020['DataValue2020_CO2']], palette=['blue', 'green'])
plt.title('Box Plot of CO2 Emissions for 2019 and 2020')
plt.xticks([0, 1], ['2019', '2020'])
plt.ylabel('CO2 Emissions (kt)')
plt.show()
No description has been provided for this image
Figure 3. Boxplot for both 2019 and 2020 CO2 Emissions
This presents a box plot comparing CO2 emissions for the years 2019 and 2020. The y-axis indicates CO2 emissions in kilotons, and the plot provides a summary of the distribution, central tendency, and variability of CO2 emissions for each year. The comparison allows for an assessment of any notable changes or trends between the two years. As expected from the large number in the low CO2 Emissions, other countries with higher than almost 0.1 kT of CO2 Emissions are treated as outliers.
In [26]:
plt.figure(figsize=(10, 6))
plt.hist(merged_data_2020['DataValue2020_CO2'], bins=20, color='green', alpha=0.7)
plt.title('Histogram of CO2 Emissions for 2020')
plt.xlabel('CO2 Emissions (kt)')
plt.ylabel('Frequency')
plt.show()
No description has been provided for this image
Figure 4. Histogram of CO2 Emissions for year 2020
This graph illustrates a histogram of CO2 emissions for the year 2020. The x-axis displays CO2 emissions in kilotons, and the y-axis represents the frequency of countries falling within specific emission levels. The histogram offers a visual representation of the distribution of carbon dioxide emissions across the included countries in 2020. Similar to 2019, there are also many countries in the low CO2 Emissions range.
In [27]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=[merged_data_2019['DataValue2019_GDP'], merged_data_2020['DataValue2020_GDP']], palette=['red', 'purple'])
plt.title('Box Plot of GDP for 2019 and 2020')
plt.xticks([0, 1], ['2019', '2020'])
plt.ylabel('GDP (constant 2015 US$)')
plt.show()
No description has been provided for this image
Figure 5. Boxplot for both 2019 and 2020 CO2 Emissions
This presents a box plot comparing GDP for the years 2019 and 2020. The y-axis indicates CO2 emissions in kilotons, and the plot provides a summary of the distribution, central tendency, and variability of CO2 emissions for each year. The comparison allows for an assessment of any notable changes or trends between the two years. As expected from the large number in the low GDP above, other countries with higher than almost 0.1 US$ are treated as outliers.

STATISTICAL ANALYSIS

"EARTH IS HEALING"

During the pandemic, a popular narrative emerged in news outlets and social media. That narrative was: "Earth is Healing". To test that narrative in terms of CO2 emissions we can conduct paired observation t-test on data for 2019 and 2020.

One-Tailed t-test on Paired Observations
$\alpha = 0.05$

NULL HYPOTHESIS
$\mu_{o}$ = $d_{o}$
There is no significant difference in CO2 emissions between 2019 and 2020.

ALTERNATIVE HYPOTHESIS
$\mu_{o}$ > $d_{o}$
There is a significant decrease in CO2 emissions between 2019 and 2020.

In [28]:
result = stats.ttest_rel(df_emission_2019,
                         df_emission_2020,
                         alternative='greater')
In [29]:
print(f'The paired observation t-score is: {result.statistic:.4f}')
print(f'The p-value is: {result.pvalue:.4f}')
The paired observation t-score is: 2.4731
The p-value is: 0.0072

We reject the null hypothesis that there is no difference in CO2 emissions between 2019 and 2020 (p = 0.0072, α = 0.05). Therefore, there is sufficient evidence to suggest that there is a significant decrease in CO2 emissions from 2019 ato 2020, during the pandemic.

ECONOMIC IMPACT OF COVID ON GDP

Given that there is significant evidence that CO2 emissions have reduced during the pandemic period, it would be interesting to investigate if GDP also experienced the same decline. Since we are looking into the relationship between GDP and CO2 emissions, we would expect the both of them to decline if they are positively correlated in some way.

One-Tailed t-test on Paired Observations
$\alpha = 0.05$

NULL HYPOTHESIS
$\mu_{o}$ = $d_{o}$
There is no significant difference in GDP between 2019 and 2020.

ALTERNATIVE HYPOTHESIS
$\mu_{o}$ > $d_{o}$
There is a significant decrease in GDP between 2019 and 2020.

In [30]:
result = stats.ttest_rel(df_gdp_2019,
                         df_gdp_2020,
                         alternative='greater')
In [31]:
print(f'The paired observation t-score is: {result.statistic:.2f}')
print(f'The p-value is: {result.pvalue:.4f}')
The paired observation t-score is: 3.02
The p-value is: 0.0014

We reject the null hypothesis that there is no difference in GDP of countries between 2019 and 2020 (p = 0.0014, α = 0.05). Therefore, there is sufficient evidence to suggest that there is a significant decrease in GDP of countries from 2019 to 2020, during the pandemic.

Since both CO2 emissions and GDP decreased during the pandemic period, this somehow supports the assumption that they are positively correlated. However, there could be latent factors at play. The Covid-19 situation is a unique one and the behavior of GDP and CO2 during this period may not be consistent with their normal patterns out of pandemic. Further investigation can be done to reveal more insights between the relationship of GDP and CO2 emissions.

CO2 EMISSIONS OF COUNTRIES PER INCOME BRACKET

In the framework of the Paris Agreement on Climate Change has one key aspect in it's framework, and that is Financing. Developed countries are expected to provide financial assistance to developing countries who are the most affected by the severe consequences of climate change. This provision relies on one key assumption, and that is developed countries are the greatest contributors to greenhouse gases. To test this we can perform ANOVA on CO2 emission per capita to check if there are significant differences among income brackets.

INCOME BRACKETS:

Category GNI Examples
HIGH $\gt$ 12,535 United States, Singapore, Switzerland
UPPER-MIDDLE 4,046 - 12,535 Mexico, Thailand, China
LOWER-MIDDLE 1,036 - 4,045 Philippines, India, Ukraine
LOW 1,036 - 4,045 Ethiopia, Syria, Afghanistan
Table 12. Income Brackets per Country based on GNI

One-Tailed t-test on Paired Observations
$\alpha = 0.05$

NULL HYPOTHESIS
$\mu_{h}$ = $\mu_{um}$ = $\mu_{lm}$ = $\mu_{l}$
There is no significant difference in CO2 emission per capita of the four income groups

ALTERNATIVE HYPOTHESIS
$Not\,all\,means\,are\,equal.$
At least one of the income groups has a significantly different CO2 emission per capita compared to the other income groups.

In [32]:
# Function to select random samples within each group

def sample_from_group(group, num_samples):
    return (group.sample(n=num_samples, replace=False, random_state=23)
            if len(group) >= num_samples
            else group)


# Apply the function to create equally sized sets of random samples
random_samples = df_epc_2019.groupby(
    'Income Bracket', group_keys=False).apply(sample_from_group, num_samples=20)
In [33]:
# ANOVA using pingouin
# Perform one-way ANOVA
anova_result = pg.anova(data=random_samples,
                        dv='CO2 per capita', between='Income Bracket')

# Print results
print(anova_result, '\n')

# Interpret results
if anova_result['p-unc'][0] < 0.05:
    print('Reject the null hypothesis. There is a significant '
          'difference in at least one group.')
else:
    print('Fail to reject the null hypothesis.'
          'There is no significant difference in group means.')
# anova_result
           Source  ddof1  ddof2          F         p-unc       np2
0  Income Bracket      3     76  35.228643  2.236668e-14  0.581696 

Reject the null hypothesis. There is a significant difference in at least one group.
In [34]:
# Assuming you have a DataFrame 'data' with 'value' and 'group' columns
tukey_results = pairwise_tukeyhsd(
    random_samples['CO2 per capita'], random_samples['Income Bracket'])
print(tukey_results)
 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
=====================================================
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     H      L  -8.0927    0.0 -10.3371 -5.8483   True
     H     LM  -6.9957    0.0  -9.2402 -4.7513   True
     H     UM  -4.8194    0.0  -7.0638  -2.575   True
     L     LM   1.0969  0.576  -1.1475  3.3414  False
     L     UM   3.2733 0.0015   1.0289  5.5177   True
    LM     UM   2.1764 0.0607  -0.0681  4.4208  False
-----------------------------------------------------

The results show that there 4 paired-differences in the means. We can observe that sequential brackets have no significant difference, except for the High Income bracket. It shows that High Income brackets have significantly greater CO2 emissions per capita than the rest of the world. This supports the basis of the Paris Agreements' financial aid provision.

INCOME BRACKET PAIR SIGNIFICANT DIFFERENCE
HIGH - UPPER MIDDLE YES
UPPER MIDDLE -LOWER MIDDLE NO
LOWER MIDDLE - LOW NO
NON CONSECUTIVE PAIRS (ex. HIGH - LOW)
YES
Table 13. Results for Subsequent Income Brackets using ANOVA

Regression Analysis

FUNCTIONS for Class Pipeline¶

In [35]:
class RegressionAnalysis:
    def __init__(self, df):
        self.X = df.iloc[:, 1:].astype(float)
        self.y = df.iloc[:, 0].astype(float)
        self.model = None

    def fit_regression(self):
        # Add the constant (alpha) to the regression model
        self.X = sm.add_constant(self.X)

        # Fit the regression model using OLS (Ordinary Least Squares)
        self.model = sm.OLS(self.y, self.X).fit()

        # Print the model summary statistics
        print(self.model.summary(), end="\n\n")

        # Get and print the coefficients
        coefficients = self.model.params
        print("=" * 100)
        print("Coefficients")
        print(coefficients)

    def plots(self):
        # Create subplots
        fig, axs = plt.subplots(1, 3, figsize=(18, 6))

        # Residuals vs Fitted Values plot
        axs[0].scatter(self.model.fittedvalues, self.model.resid)
        axs[0].set_xlabel("Fitted Values")
        axs[0].set_ylabel("Residuals")
        axs[0].set_title("Residuals vs Fitted Values")

        # Observed vs Fitted Values plot
        axs[1].scatter(self.model.fittedvalues, self.y)
        axs[1].set_xlabel("Fitted Values")
        axs[1].set_ylabel("Observed")
        axs[1].set_title("Observed vs Fitted Values")

        # Normal Q-Q plot
        sm.qqplot(self.model.resid, line="r", ax=axs[2])
        axs[2].set_title("Normal Q-Q Plot")

        # Adjust layout
        plt.tight_layout()

        # Show the combined figure
        plt.show()

    def stepwise_selection(self):
        included = []
        while True:
            excluded = list(set(self.X.columns) - set(included))
            new_pval = pd.Series(index=excluded)
            for new_column in excluded:
                model = sm.OLS(
                    self.y,
                    sm.add_constant(pd.DataFrame(
                        self.X[included + [new_column]])),
                ).fit()  # do we need to remove the constant here?
                new_pval[new_column] = model.pvalues[new_column]
            best_pval = new_pval.min()
            if best_pval < 0.05:
                best_feature = new_pval.idxmin()
                included.append(best_feature)
            else:
                break
        print("=" * 149, end="\n")
        print("Results of the stepwise selection:", included, end="\n\n")

    def check_multicollinearity(self):
        vif_data = pd.DataFrame()
        vif_data["Variable"] = self.X.columns
        vif_data["VIF"] = [
            variance_inflation_factor(self.X.values, i) 
            for i in range(self.X.shape[1])
        ]

        # Check for variables with high VIF
        print("=" * 117)
        print("Results of VIF:\n", vif_data)

Full Model Regression

This involves all countries across 2019 and 2020.

Full Model Regression (2019)¶

In [36]:
# Full model for 2019 involving all countries
reg_2019_all = RegressionAnalysis(df_all_2019)
reg_2019_all.fit_regression()
reg_2019_all.plots()
reg_2019_all.stepwise_selection()
reg_2019_all.check_multicollinearity()
                             OLS Regression Results                             
================================================================================
Dep. Variable:     CO2 Emissions (2019)   R-squared:                       0.955
Model:                              OLS   Adj. R-squared:                  0.953
Method:                   Least Squares   F-statistic:                     460.4
Date:                  Wed, 06 Dec 2023   Prob (F-statistic):           1.03e-43
Time:                          02:14:28   Log-Likelihood:                -967.88
No. Observations:                    69   AIC:                             1944.
Df Residuals:                        65   BIC:                             1953.
Df Model:                             3                                         
Covariance Type:              nonrobust                                         
======================================================================================================
                                         coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------
const                               -8.54e+04   4.01e+04     -2.130      0.037   -1.65e+05   -5333.787
GDP (2019)                          1.135e-07   2.06e-08      5.508      0.000    7.23e-08    1.55e-07
Population (2019)                      0.0013      0.000      5.549      0.000       0.001       0.002
Total Renewable Energy Consumption  3415.3423    306.316     11.150      0.000    2803.588    4027.097
==============================================================================
Omnibus:                       57.932   Durbin-Watson:                   2.108
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              389.392
Skew:                          -2.317   Prob(JB):                     2.78e-85
Kurtosis:                      13.675   Cond. No.                     3.40e+12
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.4e+12. This might indicate that there are
strong multicollinearity or other numerical problems.

====================================================================================================
Coefficients
const                                -8.539921e+04
GDP (2019)                            1.134794e-07
Population (2019)                     1.329198e-03
Total Renewable Energy Consumption    3.415342e+03
dtype: float64
No description has been provided for this image
=====================================================================================================================================================
Results of the stepwise selection: ['Total Renewable Energy Consumption', 'Population (2019)', 'GDP (2019)', 'const']

=====================================================================================================================
Results of VIF:
                              Variable       VIF
0                               const  1.168443
1                          GDP (2019)  2.682763
2                   Population (2019)  2.320713
3  Total Renewable Energy Consumption  4.217482

Full Model Regression (2020)¶

In [37]:
# Full model for 2020 involving all countries
reg_2020_all = RegressionAnalysis(df_all_2020)
reg_2020_all.fit_regression()
reg_2020_all.plots()
reg_2020_all.stepwise_selection()
reg_2020_all.check_multicollinearity()
                             OLS Regression Results                             
================================================================================
Dep. Variable:     CO2 Emissions (2020)   R-squared:                       0.953
Model:                              OLS   Adj. R-squared:                  0.950
Method:                   Least Squares   F-statistic:                     435.0
Date:                  Wed, 06 Dec 2023   Prob (F-statistic):           6.03e-43
Time:                          02:14:28   Log-Likelihood:                -969.46
No. Observations:                    69   AIC:                             1947.
Df Residuals:                        65   BIC:                             1956.
Df Model:                             3                                         
Covariance Type:              nonrobust                                         
======================================================================================================
                                         coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------
const                              -8.535e+04   4.09e+04     -2.084      0.041   -1.67e+05   -3568.567
GDP (2020)                          7.922e-08   2.23e-08      3.547      0.001    3.46e-08    1.24e-07
Population (2020)                      0.0012      0.000      4.824      0.000       0.001       0.002
Total Renewable Energy Consumption  3588.6322    301.187     11.915      0.000    2987.121    4190.144
==============================================================================
Omnibus:                       58.260   Durbin-Watson:                   2.102
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              381.445
Skew:                          -2.352   Prob(JB):                     1.48e-83
Kurtosis:                      13.514   Cond. No.                     3.34e+12
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.34e+12. This might indicate that there are
strong multicollinearity or other numerical problems.

====================================================================================================
Coefficients
const                                -8.534701e+04
GDP (2020)                            7.921794e-08
Population (2020)                     1.170253e-03
Total Renewable Energy Consumption    3.588632e+03
dtype: float64
No description has been provided for this image
=====================================================================================================================================================
Results of the stepwise selection: ['Total Renewable Energy Consumption', 'Population (2020)', 'GDP (2020)', 'const']

=====================================================================================================================
Results of VIF:
                              Variable       VIF
0                               const  1.164342
1                          GDP (2020)  2.930309
2                   Population (2020)  2.301857
3  Total Renewable Energy Consumption  4.493000

Interpretation of the Full Model Regression Analysis¶

Using a 5% significance level, all the predictor variables in the full model involving all countries contribute significantly to explaining the variability of CO2 emissions. Moreover, the predictor variables are just moderately correlated with each other across the two years. For both years, as well as in the succeeding regression models, a constant was added to the model to account for the extraneous variables that also contribute to CO2 emissions but were not considered in the regression analysis.

Isolated Regression

Developed Countries vs. Developing Countries (Can change this title to emphasize EKC)

This section provides a preliminary investigation of the Environmental Kuznets Curve (EKC) Theory, a hypothesis that suggests that environmental degradation initially increases when economic expansion occurs. However, at a certain point, a society starts to improve its relationship with the environment and environmental degradation levels start to decline. This phenomenon is best described by Figure 6.

ekc.png

Figure 6. The Environmental Kuznets Curve Theory

Under the context of EKC, the relationship between economic development and environmental degradation is analyzed longitudinally. As such, cointegration tests are used to determine the long-term relationship between the two variables. This is in constrast to the method of correlation, which focuses on a much shorter timeframe. Given the limited scope of our discussions, a correlation analysis though regression was used to make an initial deep dive into the EKC theory.

We do separate regressions on developing countries and developed countries. We surmise that the EKC is present on a particular year if the slope of the regression line for the developing countries are steeper than that of the developed countries.

Developing vs. Developed¶

The mapper file contains countries and their designated income bracket:

  • L: Low income
  • LM: Low-middle income
  • UM: Upper-middle income
  • H: High income

In this study, developing countries include low-income, low-middle, and upper-middle income countries, while developed countries include high-income countries.

Regression Analysis of Developed Countries (2019 and 2020)¶

In [38]:
# Regression analysis for the developed countries (2019)
reg_developed_2019 = RegressionAnalysis(developed_2019)
reg_developed_2019.fit_regression()
reg_developed_2019.plots()
reg_developed_2019.stepwise_selection()
reg_developed_2019.check_multicollinearity()
                             OLS Regression Results                             
================================================================================
Dep. Variable:     CO2 Emissions (2019)   R-squared:                       0.975
Model:                              OLS   Adj. R-squared:                  0.973
Method:                   Least Squares   F-statistic:                     501.5
Date:                  Wed, 06 Dec 2023   Prob (F-statistic):           1.37e-30
Time:                          02:14:29   Log-Likelihood:                -549.59
No. Observations:                    42   AIC:                             1107.
Df Residuals:                        38   BIC:                             1114.
Df Model:                             3                                         
Covariance Type:              nonrobust                                         
======================================================================================================
                                         coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------
const                              -1.072e+04    2.4e+04     -0.447      0.658   -5.93e+04    3.79e+04
GDP (2019)                          2.406e-07   3.14e-08      7.662      0.000    1.77e-07    3.04e-07
Population (2019)                     -0.0007      0.002     -0.407      0.686      -0.004       0.003
Total Renewable Energy Consumption   215.3535    319.536      0.674      0.504    -431.513     862.220
==============================================================================
Omnibus:                       11.215   Durbin-Watson:                   1.951
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               29.598
Skew:                          -0.277   Prob(JB):                     3.74e-07
Kurtosis:                       7.075   Cond. No.                     4.21e+12
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.21e+12. This might indicate that there are
strong multicollinearity or other numerical problems.

====================================================================================================
Coefficients
const                                -1.072026e+04
GDP (2019)                            2.406131e-07
Population (2019)                    -6.928047e-04
Total Renewable Energy Consumption    2.153535e+02
dtype: float64
No description has been provided for this image
=====================================================================================================================================================
Results of the stepwise selection: ['GDP (2019)']

=====================================================================================================================
Results of VIF:
                              Variable        VIF
0                               const   1.610832
1                          GDP (2019)  26.547516
2                   Population (2019)  23.569251
3  Total Renewable Energy Consumption   4.354519
In [39]:
# Regression analysis for the developed countries (2020)
reg_developed_2020 = RegressionAnalysis(developed_2020)
reg_developed_2020.fit_regression()
reg_developed_2020.plots()
reg_developed_2020.stepwise_selection()
reg_developed_2020.check_multicollinearity()
                             OLS Regression Results                             
================================================================================
Dep. Variable:     CO2 Emissions (2020)   R-squared:                       0.976
Model:                              OLS   Adj. R-squared:                  0.974
Method:                   Least Squares   F-statistic:                     508.6
Date:                  Wed, 06 Dec 2023   Prob (F-statistic):           1.05e-30
Time:                          02:14:30   Log-Likelihood:                -544.84
No. Observations:                    42   AIC:                             1098.
Df Residuals:                        38   BIC:                             1105.
Df Model:                             3                                         
Covariance Type:              nonrobust                                         
======================================================================================================
                                         coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------
const                              -5820.4427   2.15e+04     -0.271      0.788   -4.94e+04    3.77e+04
GDP (2020)                          2.141e-07   2.85e-08      7.518      0.000    1.56e-07    2.72e-07
Population (2020)                   1.603e-05      0.001      0.011      0.991      -0.003       0.003
Total Renewable Energy Consumption   124.0802    278.209      0.446      0.658    -439.124     687.285
==============================================================================
Omnibus:                       10.671   Durbin-Watson:                   1.935
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               28.324
Skew:                           0.176   Prob(JB):                     7.07e-07
Kurtosis:                       7.008   Cond. No.                     4.09e+12
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.09e+12. This might indicate that there are
strong multicollinearity or other numerical problems.

====================================================================================================
Coefficients
const                                -5.820443e+03
GDP (2020)                            2.140910e-07
Population (2020)                     1.602640e-05
Total Renewable Energy Consumption    1.240802e+02
dtype: float64
No description has been provided for this image
=====================================================================================================================================================
Results of the stepwise selection: ['GDP (2020)']

=====================================================================================================================
Results of VIF:
                              Variable        VIF
0                               const   1.621499
1                          GDP (2020)  25.742180
2                   Population (2020)  22.567237
3  Total Renewable Energy Consumption   4.782371

Interpretation¶

In the regression models for developed countries, it's evident that only GDP significantly contributes to explaining the variance in CO2 emissions for both 2019 (97.8%) and 2020 (97.6%). The inclusion of population and total renewable energy consumption did not alter the R-squared, implying that these variables introduced no meaningful information and instead introduced noise to the model.

Regression Analysis of Developing Countries (2019 and 2020)¶

In [40]:
# Regression analysis for the developing countries (2019)
reg_developing_2019 = RegressionAnalysis(developing_2019)
reg_developing_2019.fit_regression()
reg_developing_2019.plots()
reg_developing_2019.stepwise_selection()
reg_developing_2019.check_multicollinearity()
                             OLS Regression Results                             
================================================================================
Dep. Variable:     CO2 Emissions (2019)   R-squared:                       0.991
Model:                              OLS   Adj. R-squared:                  0.990
Method:                   Least Squares   F-statistic:                     877.0
Date:                  Wed, 06 Dec 2023   Prob (F-statistic):           7.58e-24
Time:                          02:14:30   Log-Likelihood:                -366.39
No. Observations:                    27   AIC:                             740.8
Df Residuals:                        23   BIC:                             746.0
Df Model:                             3                                         
Covariance Type:              nonrobust                                         
======================================================================================================
                                         coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------
const                              -1.116e+05   4.43e+04     -2.518      0.019   -2.03e+05   -1.99e+04
GDP (2019)                          9.485e-07   1.06e-07      8.952      0.000    7.29e-07    1.17e-06
Population (2019)                      0.0003      0.000      1.407      0.173      -0.000       0.001
Total Renewable Energy Consumption -1679.9926    724.665     -2.318      0.030   -3179.077    -180.908
==============================================================================
Omnibus:                       16.476   Durbin-Watson:                   2.426
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               17.481
Skew:                          -1.660   Prob(JB):                     0.000160
Kurtosis:                       5.125   Cond. No.                     3.20e+12
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.2e+12. This might indicate that there are
strong multicollinearity or other numerical problems.

====================================================================================================
Coefficients
const                                -1.115589e+05
GDP (2019)                            9.485305e-07
Population (2019)                     2.756669e-04
Total Renewable Energy Consumption   -1.679993e+03
dtype: float64
No description has been provided for this image
=====================================================================================================================================================
Results of the stepwise selection: ['GDP (2019)', 'Total Renewable Energy Consumption', 'const']

=====================================================================================================================
Results of VIF:
                              Variable        VIF
0                               const   1.260093
1                          GDP (2019)  51.923447
2                   Population (2019)   3.100136
3  Total Renewable Energy Consumption  44.659468

Interpretation¶

In the regression models applied to developing countries, it is noteworthy that only GDP makes a substantial and statistically significant contribution to explaining the variation in CO2 emissions, accounting for 99.1% in 2019 and 99.2% in 2020.

Status of the EKC Theory in 2019 and 2020¶

Between the regression models of developed and developing countries in 2019 and 2020, the variable GDP was among the significant variables that could best explain the variability in CO2 emissions. For both developed and developing countries, an increase in national income is at the expense of environmental degradation, and this is higher for developing countries. Thus, in Figure 1, we can say that both developed and developing countries are on the upward sloping part of the EKC.

CONCLUSION

The study shows a clear link between a country's GDP and its CO2 emissions. High-income countries tend to have higher CO2 emissions per capita, supporting the idea that wealthier nations contribute more to global emissions. The analysis found that GDP is a key factor in predicting CO2 emissions in both developed and developing countries. In developed countries, GDP was the main factor affecting emissions, while in developing countries, the impact of GDP on emissions was even stronger. This supports the Environmental Kuznets Curve (EKC) hypothesis, suggesting that in poorer countries, economic growth initially leads to more environmental harm. However, the study also indicates that economic growth alone might not lead to environmental improvement. This highlights the need for targeted policies to ensure that economic development goes hand in hand with environmental sustainability.

RECOMMENDATIONS

Policy Recommendations¶

  1. Differential Carbon Taxes: Implement a system where high-income countries are subject to higher carbon tax rates, while providing lower rates or subsidies for developing countries. This approach aims to balance economic growth with environmental responsibility.

  2. Technology Transfer to Developing Countries: CO2 emissions have been found to be inversely related to the consumption of renewable energy. Therefore, assisting developing countries with high levels of CO2 emissions in adopting renewable energy technology is a logical step. Developed nations can contribute by sharing their technological advancements and providing guidance on implementation.

  3. International Standards for Sustainable Development: Establish and enforce global standards for sustainable development, with specific targets for CO2 emission reduction and renewable energy adoption. Compliance could be encouraged through international benefits and recognition.

References¶

Pettinger, T. (2019, September 11). Environmental Kuznets curve - Economics Help. Economics Help. https://www.economicshelp.org/blog/14337/environment/environmental-kuznets-curve/

United Nations. (n.d.). The Paris Agreement. United Nations. https://www.un.org/en/climatechange/paris-agreement#:~:text=The%20Agreement%20is%20a%20legally

UNFCCC. (2015). The Paris Agreement. United Nations Climate Change; United Nations. https://unfccc.int/process-and-meetings/the-paris-agreement